Approximate String Matching in Sublinear Expected Time
نویسندگان
چکیده
The k differences approximate string matching problem specifies a text string of length n, a pattern string of length m, the number k of differences (insertions, deletions, substitutions) allowed in a match, and asks for every location in the text where a match occurs. Previous algorithms require at least O(nk) time. When k is as large as a fraction of m, no substantial progress has been made over O(nm) dynamic programming. We are interested in much faster algorithms for restricted cases of the problem, such as when the text string is random and errors are not too frequent. We have devised an algorithm that, for k < m/(logm + 0 ( 1 ) ) , runs in time O((n/m)k logm) on the average. In the worst case, our algorithm is O(nk) , but still an improvement in that it is very practical and uses only O(m) space compared to O(n) or O(m2). We define the approximate substring matching problem and give efficient algorithms based on our techniques. Special cases include several applications to genetics and molecular biology. For example, even allowing errors, we can find long common blocks of the text and pattern (local similarities), or select from among a set of text fragments ones that overlap one end of the pattern (sequence assembly). These are common tasks in DNA sequence analysis but are expensive to perform using previous techniques.
منابع مشابه
Approximate Pattern Matching with Samples
We simplify in this paper the algorithm by Chang and Lawler for the approximate string matching problem, by adopting the concept of sampling. We have a more general analysis of expected time with the simpli ed algorithm for the one-dimensional case under a non-uniform probability distribution, and we show that our method can easily be generalized to the two-dimensional approximate pattern match...
متن کاملSublinear Approximate String Matching
The present paper deals with the subject of approximate string matching and demonstrates how Chang and Lawler [CL94] conceived a new sublinear time algorithm out of ideas that had previously been known. The problem is to find all locations in a text of length n over a b-letter alphabet where a pattern of length m occurs with up to k differences (substitutions, insertions, deletions). The algori...
متن کاملApproximate String Matching using Backtracking over Suffix Arrays
We describe a simple backtracking algorithm that finds approximate matches of a pattern in a large indexed text. This algorithm theoretically takes sublinear time in the length of the text. We prove a lemma that helps us to prune a significant number of branches of search in practice. We show an implementation of a variant of this algorithm and that is used to find similar regions between seque...
متن کاملAll - Against - All Sequence
In this paper we present an algorithm which attempts to align pairs of subsequences from a database of DNA sequences. The algorithm simulates the classical dynamic programming alignment algorithm over a digital index of the database. The running time of the algorithm is subquadratic on average with respect to the database size. A similar algorithm solves the approximate string matching problem ...
متن کاملApproximate String Matching with Ordered q-Grams
Approximate string matching with k differences is considered. Filtration of the text is a widely adopted technique to reduce the text area processed by dynamic programming. We present sublinear filtration algorithms based on the locations of q-grams in the pattern. Samples of q-grams are drawn from the text at fixed periods, and only if consecutive samples appear in the pattern approximately in...
متن کامل